My PhD research programme is all about cladistic data congruence, compatibility, convergent evolution and phylogenetic tree-to-tree distance measures, while in parallel I work on a Panton Fellowship on data mining. The latter is about the mining of data direct from the literature and encouraging a culture of openness and data sharing. I shall be presenting this work at the pro-iBiosphere workshop in February 2013. My mentor for this project is Peter Murray-Rust, a Cambridge-based computational chemist and co-author of the Panton Principles for Open Data in Science. I'm fairly new to data mining techniques and machine learning methods, but am learning fast, and am certainly looking forward to meeting researchers at this event using similar techniques and methods for taxonomic data.
Specifically I'm looking to extract phylogenetic tree data direct from the figures of phylogenetic papers; including the exact relationships between taxa, branch lengths and support values. Unlike with some other data mining efforts that are entirely text-based, this requires some data extraction from non-textual sources. Some attempts have already been made to do this with programs like TreeThief, TreeRipper and TreeSnatcher but none of these are realistically and systematically applicable to tens of thousands of phylogenetic papers in their current state.
There is a huge wealth of phylogenetic data in the literature – I was co-author on a paper recently that shows that there are more than 66,000 separate papers containing novel empirically-generated phylogenetic trees in just the 21st century, and that less than 4% of these data are publicly available in a re-usable form. I, and many others, think this hugely valuable and repurposable data should be kept and made openly available for re-use, hence I'm trying to systematically salvage it from the literature.
Another novel aspect of our approach is that we're mining PDF's rather than publisher-provided XML or HTML. The latter do not contain the figures, just links to them, and thus they can only help us recover metadata on each phylogenetic tree. The PDF is often the only format in which it's all there and sometimes in clearly machine-interpretable format. Peter has been particularly vocal on his blog about the quality, or lack, of PDF files produced by some publishers on behalf of authors. Personally I feel that all supporting data for a paper should be made openly available, as the 'default' with exceptions to this rule only allowed with clear and explicit justification. I'm still surprised, and slightly disappointed, that this isn't yet the norm in scientific publishing – we certainly have the technology to do this. In 2013 I shall be submitting my PhD thesis; looking for further academic funding/employment and will assume the new role of Science Community Coordinator at the Open Knowledge Foundation – working to join together all the various open science labs around the world. I heartily look forward to meeting everyone at February’s pro-iBiosphere workshop.
Ross Mounce
PhD Student & Panton Fellow, University of Bath, United Kingdom
pro-iBiosphere wiki platform